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ABSTRACT 

The  area  of  automatie  speaker  reeognition  has  been  dominated 
by  systems  using  only  short-term,  low-level  aeoustie 
information,  sueh  as  eepstral  features.  While  these  systems  have 
indeed  produeed  very  low  error  rates,  they  ignore  other  levels  of 
information  beyond  low-level  aeousties  that  eonvey  speaker 
information.  Reeently  published  work  has  shown  examples  that 
sueh  high-level  information  ean  be  used  sueeessfully  in 
automatie  speaker  reeognition  systems  and  has  the  potential  to 
improve  aeeuraey  and  add  robustness.  For  the  2002  JHU  CLSP 
summer  workshop,  the  SuperSID  projeet 
(http://www.elsp.jhu.edu/ws2002/groups/supersid/)  was 

undertaken  to  exploit  these  high-level  information  sourees  and 
dramatieally  inerease  speaker  reeognition  aeeuraey  on  a  defined 
NIST  evaluation  eorpus  and  task.  This  paper  provides  an 
overview  of  the  strueture,  data,  task,  tools,  and  aeeomplishments 
of  this  projeet.  Wide  ranging  approaehes  using  pronuneiation 
models,  prosodie  dynamies,  piteh  and  duration  features,  phone 
streams,  and  eonversational  interaetions  were  explored  and 
developed.  In  this  paper  we  show  how  these  novel  features  and 
elassifiers  indeed  provide  eomplementary  information  and  ean  be 
fused  together  to  drive  down  the  equal  error  rate  on  the  2001 
NIST  extended  data  task  to  0.2%  —  a  71%  relative  reduetion  in 
error  over  the  previous  state  of  the  art. 

1.  INTRODUCTION 

What  is  it  in  the  speeeh  signal  that  eonveys  speaker  identity? 
This  is  one  of  the  eentral  questions  addressed  by  automatie 
speaker  reeognition  researeh.  From  self-observation  and 
experienee,  it  is  pretty  elear  that  we  (humans)  rely  on  several 
different  types  or  levels  of  information  in  the  speeeh  signal  to 
reeognize  others  from  voiee  alone.  These  ean  be  the  deep  bass 
and  timber  of  a  voiee,  a  friend’s  unique  laugh,  or  the  partieular 
repeated  word  usage  of  a  eolleague.  Roughly  we  ean  eategorize 
these  into  a  hierarehy  running  from  low-level  information,  sueh 
as  the  sound  of  a  person’s  voiee,  related  to  physieal  traits  of  the 
voeal  apparatus,  to  high-level  information,  sueh  as  partieular 
word  usage  (idioleet),  related  to  learned  habits  and  style.  While 
all  of  these  levels  appear  to  eonvey  useful  speaker  information, 
automatie  speaker  reeognition  systems  have  relied  almost 
exelusively  on  low-level  information  via  short-term  features 
related  to  the  speeeh  speetrum.  With  the  eontinual  advaneement 
of  tools,  sueh  as  phone  and  speeeh  reeognition  systems,  to 
reliably  extraet  features  for  high-level  eharaeterization,  the 


inerease  in  applieations  (like  audio  mining)  allowing  for 
relatively  large  amounts  of  speeeh  from  a  speaker  to  learn 
speaking  habits,  the  availability  of  large  development  eorpora 
and  plentiful  eomputational  resourees,  the  time  is  right  for  a 
deeper  exploration  into  using  these  underutilized  high-level 
information  sourees.  These  new  sourees  of  information  hold  the 
promise  not  only  for  improvement  in  basie  reeognition  aeeuraey 
by  adding  eomplementary  knowledge,  but  also  the  possibility  for 
robustness  to  aeoustie  degradations  from  ehannel  and  noise 
effeets,  to  whieh  low-level  features  are  highly  suseeptible. 
Furthermore,  previous  work  examining  eertain  high-level 
information  sourees  has  provided  strong  indieations  that  potential 
gains  are  possible  (for  example  see  reeent  papers  [1, 2,3,4]). 

Inspired  by  these  faetors,  the  SuperSID  projeet  for  the 
exploitation  of  high-level  information  for  high-performanee 
speaker  reeognition  was  undertaken  as  part  of  the  2002  JHU 
Summer  Workshop  on  Human  Language  Teehnology  [5].  The 
JHU  WS2002  is  one  in  a  series  of  6-week  workshops  hosted  by 
the  CLSP  group  at  JHU  with  the  aim  of  bringing  together 
researehers  to  foeus  on  ehallenging  projeets  in  the  areas  of 
speeeh  and  language  engineering.  The  authors  of  this  paper 
eonstituted  the  team  members  for  the  SuperSID  projeet 
representing  a  diverse  group  of  senior  researehers  from 
aeademia,  eommereial,  independent  and  Government  researeh 
eenters,  as  well  as  graduate  and  undergraduate  students.  The  aim 
of  the  SuperSID  projeet  was  to  analyze,  eharaeterize,  extraet,  and 
apply  high-level  information  to  the  speaker  reeognition  task.  The 
goals  were  to  develop  new  features  and  elassifiers  exploiting 
high-level  information,  show  performanee  improvements  relative 
to  baselines  on  an  established  evaluation  data  and  task,  and 
demonstrate  that  new  features  and  elassifiers  provide 
eomplementary  information. 

This  paper  provides  an  overview  of  the  framework  and  overall 
aeeomplishments  of  the  SuperSID  projeet.  Details  of  the  various 
approaehes  undertaken  in  the  projeet  ean  be  found  in  the 
eompanion  papers  related  to  the  SuperSID  projeet  [6,7,8,9,10]  as 
well  as  on  the  SuperSID  website  [11]. 

2.  TASK,  DATA  AND  TOOLS 

The  foeus  for  the  SuperSID  projeet  was  on  text-independent 
speaker  deteetion  using  the  extended  data  task  from  the  2001 
NIST  Speaker  Reeognition  Evaluation  [12].  This  task  was 
introdueed  to  allow  exploration  and  development  of  teehniques 
that  ean  exploit  signifieantly  more  training  data  than  is 
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traditionally  used  in  NIST  evaluations.  Speaker  models  are 
trained  using  1,2, 4, 8,  and  16  eomplete  eonversation  sides  (where 
a  eonversation  side  is  nominally  2.5  minutes  long)  as  opposed  to 
the  normal  2  minutes  of  training  speeeh  used  in  other  NIST 
evaluations.  A  eomplete  eonversation  side  was  used  for  testing. 
The  2001  extended  data  task  used  the  entire  Switehboard-I 
eonversational  telephone  speeeh  eorpus.  To  supply  a  large 
number  of  target  and  non-target  trials  and  speaker  models  trained 
with  up  to  16  eonversations  of  training  speeeh  (~40  minutes),  the 
evaluation  used  a  eross-validation  proeessing  of  the  entire 
eorpus.  The  eorpus  was  divided  into  6  partitions  of  ~80  speakers 
eaeh.  All  trials  within  a  partition  involved  models  and  test 
segments  from  within  that  partition  only;  data  from  the  other  5 
partitions  were  available  for  baekground  model  building, 
normalization,  ete.  The  task  eonsists  of  -500  speakers  with 
-4100  target  models  (a  speaker  had  multiple  models  for  different 
amounts  of  training  data)  and  -57,000  trials  for  the  testing  phase, 
eontaining  matehed  and  mismatehed  handset  trials  and  some 
eross-sex  trials.  The  eross-validation  experiments  were  driven  by 
NIST’s  speaker  model  training  lists  and  index  files  indieating 
whieh  models  were  to  be  seored  against  whieh  eonversation  sides 
for  eaeh  partition. 

Seores  from  eaeh  partition  are  pooled  and  a  deteetion  error 
tradeoff  (DET)  eurve  is  plotted  to  show  system  results  at  all 
operating  points.  The  equal  error  rate  (EER),  where  the  false 
aeeeptanee  rate  equals  the  missed  deteetion  rate,  is  used  as  a 
summary  performanee  measure  for  eomparing  systems\ 

The  2001  extended  data  task  was  seleeted  for  the  projeet  beeause 
of  the  availability  of  several  Switehboard-I  annotated  resourees 
providing  features  and  measures  related  to  high-level  speaker 
information. 

•  SRI  prosody  database  [13]:  The  SRI  database  provides 
frame-level  piteh  and  energy  traeks  (in  raw  and  stylized  forms) 
as  well  as  a  wealth  of  word-level  prosodie  features  derived  both 
for  "truth"  transeripts  and  for  speeeh  reeognizer  output,  time- 
aligned  to  the  speeeh  stream  at  the  phone  level.  Features  inelude 
pause  and  segmental  durations,  voieing  and  stress  information, 
piteh  statisties,  and  mueh  more. 

•  Four  word  transcriptions  of  varying  word  error  rates 
(WER):  Manual  transeripts  from  ISIP,  automatie  transeripts  from 
Dragon  Systems  (-20%  WER),  automatie  transeripts  from  SRTs 
Deeipher  (-30%  WER),  and  automatie  transeripts  from  BBN’s 
real-time  Byblos  (-50%  WER)^ 

•  Two  sets  of  open-loop  (i.e.,  no  language  models  in  decoder) 
phone  transcripts  in  various  languages:  From  MIT’s  PPRLM 
system,  we  had  phone  transeripts  in  English,  German,  Japanese, 
Mandarin,  and  Spanish.  From  CMU’s  GlobalPhone  system,  we 
had  phone  transeripts  in  Chinese,  Arabie,  Freneh,  Japanese, 
Korean,  Russian,  German,  Croatian,  Portuguese,  Spanish, 
Swedish,  and  Turkish. 


‘  Due  to  the  limited  number  of  speakers/models,  the  results  for  the  16- 
conversation  training  condition  were  found  to  have  high  statistical 
variation  so  we  will  generally  cite  results  only  up  to  the  8-conversation 
training  condition. 

"  These  automatic  transcripts  were  selected  to  provide  a  range  of  WERs 
and  do  not  reflect  fundamental  differences  in  the  supplier's  technology. 


•  Articulatory  feature  transcripts  [14]:  (pseudo-)articulatory 
classes  automatically  extracted  from  the  speech  signal  and 
designed  to  capture  characteristics  of  speech  production  such  as 
consonantal  place  of  articulation,  manner  of  articulation,  voicing, 
etc. 

We  also  assembled  a  suite  of  models  to  apply  to  features  we 
extracted  from  the  above  data  sets.  These  included  standard  n- 
gram  tools  found  in  the  CMU-CU  language  modeling  toolkif^^  as 
well  as  a  “bag-of-n-grams”  classifier  as  described  in  [2],  a 
discrete  token  binary  tree  classifier  [7],  a  discrete  HMM 
classified'',  a  continuous  GMM  classifier'',  and  a  MLP  fusion 
toor. 

These  models  were  used  to  form  likelihood  ratio  detectors  by 
creating  a  speaker  model  using  training  data  and  a  single 
speaker-independent  background  model  using  data  from  the 
held-out  splits.  For  some  systems  a  set  of  individual  background 
speaker  models  from  the  held-out  set  were  used  as  cohort 
models.  During  recognition,  a  test  utterance  is  scored  against  the 
speaker  and  background  model(s)  and  the  ratio  (or  in  the  log 
domain,  difference)  is  reported  as  the  detection  score  for  sorting. 

3.  APPROACHES 

In  this  section  we  survey  some  of  the  highlights  of  approaches 
developed  to  exploit  high-level  speaker  information.  The  reader 
should  consult  the  referenced  papers  for  more  details. 

3.1  Acoustic  Features 

Although  this  project  purposely  avoided  using  standard  acoustic 
frame-level  signal  processing  features  such  as  cepstra,  we  wanted 
to  establish  a  baseline  of  standard  approaches  on  the  extended 
data  set.  The  acoustic  system  was  a  standard  GMM-UBM  system 
using  short-term  cepstral-based  features  [15]  with  a  2048  mixture 
UBM  built  using  data  from  the  Switchboard-II  corpus.  This 
system  produces  an  EER  ranging  from  3.3%  for  1 -conversation 
training  to  0.7%  for  8 -conversation  training. 

3.2  Prosodic  Features 

•  Pitch  and  Energy  Distributions  [10]:  As  a  baseline  a  simple 
GMM  classifier  using  a  feature  vector  consisting  of  per-frame 
log  pitch,  log  energy  and  their  first  derivatives  was  developed 
which  produced  an  EER  of  16.3%  for  8 -conversation  training. 

•  Pitch  and  Energy  Track  Dynamics  [10]:  The  aim  was  to 
learn  pitch  and  energy  gestures  by  modeling  the  joint  slope 
dynamics  of  pitch  and  energy  contours.  A  sequence  of  symbols 
describing  the  pitch  and  energy  slope  states  (rising,  falling), 
segment  duration  and  phone  or  word  context  is  used  to  train  an 
n-gram  classifier.  Using  only  slope  and  duration  produced  an 
EER  of  14.1%  for  8 -conversation  training,  which  dropped  to 
9.2%  when  fused  with  the  absolute  pitch  and  energy 
distributions,  indicating  it  is  capturing  new  information  about  the 
pitch  and  energy  features.  Although  not  purely  a  prosodic 
system,  adding  phone  context  to  duration  and  contour  dynamics 
produces  an  EER  of  5.2%.  Examining  pitch  dynamics  by 


http://svr-www.eng.cam.ac.uk/~prc  14/toolkit.html 
http://www.cfar.umd.edu/~kanungo/software/software.html 
From  MITEL ’s  GMM-UBM  speaker  recognition  system 
http://www.ll.mit.edu/IST/lnknet/ 
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dynamic  time  warp  matching  of  word-dependent  pitch  tracks 
using  15  words  or  short  phrases  produced  an  EER  of  13.3%. 

•  Prosodic  Statistics  [9]:  Using  the  various  measurements 
from  the  SRI  prosody  database,  19  statistics  from  duration  and 
pitch  related  features,  such  as  mean  and  variance  of  pause 
durations  and  FO  values  per  word,  were  extracted  from  each 
conversation  side.  Using  these  feature  vectors  in  a  K  nearest 
neighbor  classifier  on  8 -conversation  training  produced  an  EER 
of  15.2%  for  the  11  duration  related  statistics,  14.8%  for  the  8 
pitch  related  statistics  and  8.1%  for  all  19  features  combined. 

3.3  Phone  Features 

•  Phone  N- grams  [4]:  In  this  approach  the  time  sequence  of 
phones  coming  from  a  bank  of  open-loop  phone  recognizers  is 
used  to  capture  some  information  about  speaker-dependent 
pronunciations.  Multiple  phone  streams  are  scored  independently 
and  fused  at  the  score  level.  Using  the  5  PPRLM  phone  streams 
and  the  “bag-of-n-grams”  classifier  an  EER  of  4.8%  was 
obtained  for  8 -conversation  training. 

•  Phone  Binary  Trees  [7]:  This  approach  also  aims  to  model 
the  time  sequence  of  phone  tokens,  but  instead  of  an  n-gram 
model  a  binary  tree  model  is  used.  With  a  binary  tree,  it  is 
possible  to  use  large  context  without  exponential  memory 
expansion  and  the  structure  lends  itself  to  some  adaptation  and 
recursive  smoothing  techniques  important  for  sparse  data  sets. 
Using  a  3  token  history  (equivalent  to  4-grams)  and  adaptation 
from  a  speaker-independent  tree,  an  EER  of  3.3%  is  obtained  for 
8-conversation  training.  The  main  improvement  with  this 
approach  is  robustness  for  limited  training  conditions.  For 
example,  it  obtains  an  EER  of  11%  for  1 -conversation  training 
compared  to  33%  for  the  n-gram  classifier. 

•  Cross-stream  Phone  Modeling  [6]:  While  the  above  phone 
approaches  attempt  to  model  phone  sequences  in  the  temporal 
dimension,  this  approach  examines  capturing  cross-stream 
information  from  the  multiple  phone  streams.  The  phone  streams 
are  first  aligned  and  then  co-occurrence  of  the  different  language 
phones  are  modeled  via  n-grams.  This  produces  an  EER  of  4.0% 
for  8 -conversation  training.  Cross-stream  and  temporal  systems 
can  be  fused  together  to  produce  an  EER  of  3.6%.  In  general  this 
technique  can  be  expanded  using  graphical  models  to 
simultaneously  capture  both  cross-stream  and  temporal  sequence 
information. 

•  Pronunciation  Modeling  [8]:  The  aim  here  is  to  learn 
speaker-dependent  pronunciations  by  comparing  constrained 
word-level  automatic  speech  recognition  (ASR)  phone  streams 
with  open-loop  phone  streams.  The  phones  from  the  SRI  ASR 
word  transcripts  are  aligned  on  a  per  frame  level  with  the 
PPRLM  open-loop  phones  and  conditional  probabilities  for  each 
open-loop  phone  given  an  ASR  phone  are  computed  per  speaker 
and  for  a  background  model.  For  8 -conversation  training  this 
simple  technique  produces  an  amazing  2.3%  EER. 

3.4  Lexical  Features 

Although  not  an  active  focus  in  the  project,  an  n-gram  idiolect 
system  like  that  described  in  [2]  was  implemented  and  used  to 
examine  the  effects  of  using  errorful  word  transcripts.  The  8- 
conversation  training  EERs  for  the  different  transcripts  are  as 


follows:  Manual  9%,  Dragon  11%,  SRI  12%,  BBN  16%.  So  the 
approach  appears  to  be  relatively  robust  even  as  WER  increases 
to  50%. 

3.5  Conversational  Features 

In  this  approach,  we  examined  whether  there  was  speaker 
information  in  turn- taking  patterns  and  conversational  style.  The 
motivation  of  this  work  is  from  results  in  the  2002  NIST 
evaluation  where  n-grams  of  speaker  turn  durations  and  word 
density  were  able  to  produce  an  EER  of  26%  for  8 -conversation 
training.  A  system  was  developed  using  feature  vectors 
containing  turn-based  information  about  pitch,  duration  and  rates 
derived  from  the  SRI  prosody  database.  These  feature  vectors 
were  converted  into  a  sequence  of  turn-based  tokens  from  which 
n-gram  models  were  created  to  capture  turn  characteristics  [9]. 
On  split  1  for  8 -conversation  training  the  best  system  EER  was 
15.2%.  We  also  examined  conditional  word  usage  in  speaker 
turns  with  the  idea  that  a  speaker  may  adapt  his/her  word  usages 
based  on  his/her  conversational  partner,  but  found  this  produced 
>  26%  EER. 

4.  FUSION 

Given  the  pallet  of  new  features  and  approaches  outlined  above 
we  next  set  out  to  examine  fusion  of  the  different  levels  of 
information  to  see  if  they  are  indeed  providing  complementary 
information  to  improve  performance.  For  the  workshop  we  used 
a  simple  single  layer  perceptron  with  sigmoid  outputs  for  fusing 
system  scores.  A  fuser  was  trained  for  each  split  using  the  five 
held  out  splits.  There  are  no  doubt  better  fusion  approaches  for 
combining  information  sources,  but  the  aim  here  was  merely  a 
proof  of  concept.  For  the  fusion  experiment  we  selected  the  9 
best  performing  individual  systems  covering  acoustic,  prosodic, 
phonetic  and  lexical  approaches.  The  EERs  for  the  individual 
systems  are  shown  in  Table  1.  After  the  GMM  cepstra  system  the 
best  performing  system  is  the  one  based  on  pronunciation 
modeling. 


Table  1  The  nine  component  systems  to  be  fused.  EERs 
are  from  the  8 -conversation  training  condition. 


System 

EER  (%) 

1.  Acoustic  baseline  (GMM-UBM  cepstral 
features) 

0.7 

2.  Pitch  and  energy  distributions 

16.3 

3.  Pitch  and  energy  slopes  +  durations  +  phone 
context 

5.2 

4.  Prosodic  statistics 

8.1 

5.  Phones  n-grams  (5  PPRLM  phone  sets) 

4.8 

6.  Phone  binary  trees  (5  PPRLM  phone  sets) 

3.3 

7.  Phone  cross-stream  +  temporal  (5  PPRLM 
phone  sets) 

3.6 

8.  Pronunciation  modeling  (SRI  prons  +  5 
PPRLM  phone  sets) 

2.3 

9.  Word  n-grams  (Dragon  transcripts) 

11.0 
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In  Figure  1  we  show  a  DET  plot  with  three  eurves  from  the 
fusion  experiment.  The  top  two,  with  EER=0.7%,  are  for  the 
GMM  eepstra  system  alone  and  from  fusing  all  but  the  GMM 
eepstra  system  (fuse  8).  The  fusion  of  all  9  systems  produees  the 
bottom  eurve  with  EER=0.2%  —  a  71%  relative  reduetion. 
Based  on  the  number  of  trials,  this  is  a  statistieally  signifieant 
improvement.  These  results  elearly  show  that  the  new  features 
and  elassifiers  are  supplying  eomplementary  information  to  the 
baseline  aeoustie  system. 


False  Alarm  probability  (in  %) 

Figure  1  DET  plot  showing  three  eurves.  Using  only 
GMM-eepstra  (EER=0.7%),  fusing  8  systems  without 
GMM-eepstra  (EER=0.7%),  and  fusing  all  9  systems 
(EER=0.2%). 

We  also  eondueted  experiments  examining  fusing  subsets  of  the 
systems.  The  single  best  system  to  fuse  with  the  GMM  eepstral 
system  (system  1  in  table)  is  the  piteh/energy  slope  system 
(system  3),  yielding  an  EER  of  0.3%.  It  is  intuitively  appealing 
to  see  that  a  system  that  eovers  both  prosodie  and  phone 
information  was  the  best  one  to  fuse  with  the  standard  aeousties. 
The  best  two  non-GMM-eepstral  systems  to  fuse,  with  an  EER  of 
1 .2%,  were  the  pronuneiation  (system  8)  and  piteh/energy  slopes 
(system  3).  The  best  three  non-GMM-eepstral  system 
eombinations  gave  an  EER  of  0.9%.  There  were  three 
eombinations  that  produeed  this  EER:  Systems  (8,  4,  3),  (8,  4,  9) 
and  (8,  3,  9).  In  eaeh  ease  the  pronuneiation  system  (8)  is 
ineluded  with  addition  of  the  piteh/energy  slope  (3),  the  prosodie 
statisties  (4),  and/or  the  word  n-gram  (9)  systems.  The  sampling 
of  different  levels  of  information  in  these  eombinations  is  also 
intuitively  appealing  and  again  eonfirms  that  the  systems  are 
indeed  providing  eomplementary  information. 

5.  CONCLUSIONS  AND  FUTURE 
DIRECTIONS 

From  the  results  presented  in  this  paper  and  in  the  eompanion 
papers,  it  is  elear  that  the  SuperSID  projeet  aehieved  the  aim  of 
exploiting  high-level  information  to  improve  speaker  reeognition 
performanee.  Even  at  extremely  low  error  rates,  it  was  shown 
that  there  is  still  signifieant  benefit  in  eombining  eomplementary 
types  of  information. 


However,  this  is  just  the  beginning  of  truly  exploiting  these 
sourees  of  speaker  information,  with  many  open  avenues  to 
explore.  First,  the  results  need  to  be  validated  on  a  different 
eorpus  to  show  they  indeed  generalize.  Current  work  is 
underway  to  implement  these  approaehes  on  the  Switehboard-II 
eorpus,  whieh  has  a  higher  aeoustie  error  rate.  Seeond,  we  need 
to  expand  our  error  analysis  to  understand  whieh  errors  are  left 
and  what  features  ean  address  them.  Third,  we  need  to  examine 
better  ways  of  feature  seleetion  and  eombinations  perhaps 
ineorporating  eonfidenee  measures  to  know  when  different  types 
of  feature s/sy stems  are  reliable.  Finally,  we  need  to  examine  the 
relative  robustness  of  the  knowledge  sourees  to  faetors  like 
noise,  ehannel  variability,  speaking  partners,  topies  and 
language. 
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